Investigate app usage and provide recommendations which will allow the Norbu app increase its Retention rate on 28 day from 4% to 20%.
What proportion of those participating in the test survey have sleep problems, in which countries, cities? What does the Device model affect
Calculate user LTV all Calculate user LTV / GEO Calculate user LTV / Traffic source (medium) Calculate user LTV / Platform Calculate user LTV revenue What affects LTV?
Retention 1,3,7,30 day / Platform Retention 1,3,7,30 day / GEO What affects Retention?
What affects the removal (app remove)?
What affects the in-app purchase / product ID What affects the renewal of the subscriptions? What affects user engagment?
Derive and analyze patterns of User journey behavior Correlations Stress evaluation / Pulse-stress-test
How clear and easy to learn is the MBSC mindfulness meditation level 1
% started 5-day-unlock a traning % completed 5-day-unlock a traning % user started 5-day-unlock a traning -> not completed this unlock -> made purchase of this training
# !pip install pandas_gbq
# !pip install google-auth==1.8.2
# pip install google-cloud-core==1.5.0
# import sys
# !conda install --yes --prefix {sys.prefix} plotly
# importing libraries
from pandas.io import gbq
from google.oauth2 import service_account
import pandas_gbq
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from plotly import graph_objects as go
import plotly.express as px
# import sys
# !conda install --yes --prefix {sys.prefix} matplotlib
# !conda install --yes --prefix {sys.prefix} seaborn
projectid="norbu-app-6b798"
# pandas_gbq.read_gbq('SELECT event_timestamp FROM `norbu-app-6b798.firebase_messaging.data` LIMIT 100', project_id=projectid)
# pandas_gbq.read_gbq('SELECT * FROM `norbu-app-6b798.firebase_messaging.data` LIMIT 10', project_id=projectid)
# pandas_gbq.read_gbq('SELECT event, COUNT(event) FROM `norbu-app-6b798.firebase_messaging.data` GROUP BY event', project_id=projectid)
# sql = """
# SELECT event_date, event_timestamp, event_name, event_previous_timestamp, event_value_in_usd, event_bundle_sequence_id, event_server_timestamp_offset, user_id, user_pseudo_id
# FROM `norbu-app-6b798.analytics_201683422.events_*`
# """
# df = pandas_gbq.read_gbq(sql, project_id=projectid)
# df.to_pickle('/Users/elizavetagorbunova/Downloads/Study/DA_Yandex/Norbu/events.pkl')
events = pd.read_pickle('/Users/elizavetagorbunova/Downloads/Study/DA_Yandex/Norbu/events.pkl')
# sql = """
# SELECT user_pseudo_id, user_first_touch_timestamp, user_ltv FROM `norbu-app-6b798.analytics_201683422.events_*`
# """
# df = pandas_gbq.read_gbq(sql, project_id=projectid)
# df.to_pickle('/Users/elizavetagorbunova/Downloads/Study/DA_Yandex/Norbu/users1.pkl')
users = pd.read_pickle('/Users/elizavetagorbunova/Downloads/Study/DA_Yandex/Norbu/users.pkl')
# sql = """
# SELECT user_pseudo_id, device FROM `norbu-app-6b798.analytics_201683422.events_*`
# """
# df = pandas_gbq.read_gbq(sql, project_id=projectid)
# df.to_pickle('/Users/elizavetagorbunova/Downloads/Study/DA_Yandex/Norbu/devices.pkl')
devices = pd.read_pickle('/Users/elizavetagorbunova/Downloads/Study/DA_Yandex/Norbu/devices.pkl')
# sql = """
# SELECT user_pseudo_id, geo FROM `norbu-app-6b798.analytics_201683422.events_*`
# """
# df = pandas_gbq.read_gbq(sql, project_id=projectid)
# df.to_pickle ('/Users/elizavetagorbunova/Downloads/Study/DA_Yandex/Norbu/geo.pkl')
geo = pd.read_pickle('/Users/elizavetagorbunova/Downloads/Study/DA_Yandex/Norbu/geo.pkl')
# sql = """
# SELECT user_pseudo_id, traffic_source FROM `norbu-app-6b798.analytics_201683422.events_*`
# """
# df = pandas_gbq.read_gbq(sql, project_id=projectid)
# df.to_pickle('/Users/elizavetagorbunova/Downloads/Study/DA_Yandex/Norbu/traffic.pkl')
traffic = pd.read_pickle('/Users/elizavetagorbunova/Downloads/Study/DA_Yandex/Norbu/traffic.pkl')
# function to get basic info about the dataset
def get_info(df):
print("Head:")
display(df.head())
print()
print("Info:")
display(df.info())
print()
print("Description of quantitative parameters:")
display(df.describe())
print()
print("Description of categorical parameters:")
display(df.describe(include='object'))
print()
print("Columns with NaN values:")
display(df.isna().sum())
print()
print("Shape:")
display(df.shape)
print()
print("Number of duplicated rows:")
display(df.duplicated().sum())
events.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5756075 entries, 0 to 5756074 Data columns (total 9 columns): # Column Dtype --- ------ ----- 0 event_date object 1 event_timestamp int64 2 event_name object 3 event_previous_timestamp float64 4 event_value_in_usd float64 5 event_bundle_sequence_id int64 6 event_server_timestamp_offset int64 7 user_id object 8 user_pseudo_id object dtypes: float64(2), int64(3), object(4) memory usage: 395.2+ MB
traffic.head()
| user_pseudo_id | traffic_source | |
|---|---|---|
| 0 | 0cbfacb17dac3d66a049c15a243cbcf3 | {'name': 'instagram', 'medium': 'insta', 'sour... |
| 1 | 0cbfacb17dac3d66a049c15a243cbcf3 | {'name': 'instagram', 'medium': 'insta', 'sour... |
| 2 | 0cbfacb17dac3d66a049c15a243cbcf3 | {'name': 'instagram', 'medium': 'insta', 'sour... |
| 3 | 0cbfacb17dac3d66a049c15a243cbcf3 | {'name': 'instagram', 'medium': 'insta', 'sour... |
| 4 | 0cbfacb17dac3d66a049c15a243cbcf3 | {'name': 'instagram', 'medium': 'insta', 'sour... |
get_info(events)
Head:
| event_date | event_timestamp | event_name | event_previous_timestamp | event_value_in_usd | event_bundle_sequence_id | event_server_timestamp_offset | user_id | user_pseudo_id | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 20210606 | 1622987150469711 | scr_breathe_home | 1.617608e+15 | NaN | 30 | 874711 | 9ca217af-8bbf-462a-bdd2-844b5e335b33 | deb456bc2ea690aa5544c7f056e46747 |
| 1 | 20210606 | 1622987152698712 | scr_breathe | 1.617608e+15 | NaN | 30 | 874711 | 9ca217af-8bbf-462a-bdd2-844b5e335b33 | deb456bc2ea690aa5544c7f056e46747 |
| 2 | 20210606 | 1622987163806713 | scr_breathe_home | 1.622987e+15 | NaN | 30 | 874711 | 9ca217af-8bbf-462a-bdd2-844b5e335b33 | deb456bc2ea690aa5544c7f056e46747 |
| 3 | 20210606 | 1622987163809714 | scr_AssetsLoaderPage | 1.622987e+15 | NaN | 30 | 874711 | 9ca217af-8bbf-462a-bdd2-844b5e335b33 | deb456bc2ea690aa5544c7f056e46747 |
| 4 | 20210606 | 1622987163815715 | scr_training_details | 1.622987e+15 | NaN | 30 | 874711 | 9ca217af-8bbf-462a-bdd2-844b5e335b33 | deb456bc2ea690aa5544c7f056e46747 |
Info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 5756075 entries, 0 to 5756074 Data columns (total 9 columns): # Column Dtype --- ------ ----- 0 event_date object 1 event_timestamp int64 2 event_name object 3 event_previous_timestamp float64 4 event_value_in_usd float64 5 event_bundle_sequence_id int64 6 event_server_timestamp_offset int64 7 user_id object 8 user_pseudo_id object dtypes: float64(2), int64(3), object(4) memory usage: 395.2+ MB
None
Description of quantitative parameters:
| event_timestamp | event_previous_timestamp | event_value_in_usd | event_bundle_sequence_id | event_server_timestamp_offset | |
|---|---|---|---|---|---|
| count | 5.756075e+06 | 4.078767e+06 | 225.000000 | 5.756075e+06 | 5.756075e+06 |
| mean | 1.620188e+15 | 1.619921e+15 | 5.846177 | 3.301320e+01 | 6.682094e+08 |
| std | 1.493431e+12 | 3.510497e+12 | 8.262899 | 1.588862e+02 | 3.477280e+11 |
| min | 1.617571e+15 | 2.125570e+08 | 0.000000 | 1.000000e+00 | 0.000000e+00 |
| 25% | 1.618895e+15 | 1.618747e+15 | 0.000000 | 4.000000e+00 | 4.676990e+05 |
| 50% | 1.620080e+15 | 1.619966e+15 | 2.490000 | 7.000000e+00 | 1.242702e+06 |
| 75% | 1.621433e+15 | 1.621327e+15 | 4.840662 | 1.500000e+01 | 2.216415e+06 |
| max | 1.623000e+15 | 2.119286e+15 | 34.512240 | 2.100000e+03 | 2.023331e+14 |
Description of categorical parameters:
| event_date | event_name | user_id | user_pseudo_id | |
|---|---|---|---|---|
| count | 5756075 | 5756075 | 958524 | 5756075 |
| unique | 63 | 181 | 15213 | 83884 |
| top | 20210419 | scr_home | 7769a3cb-9ded-4c2d-8d19-0b448534eb11 | ededd30e5bf41b0df785f14373c06b58 |
| freq | 121758 | 622670 | 3164 | 39906 |
Columns with NaN values:
event_date 0 event_timestamp 0 event_name 0 event_previous_timestamp 1677308 event_value_in_usd 5755850 event_bundle_sequence_id 0 event_server_timestamp_offset 0 user_id 4797551 user_pseudo_id 0 dtype: int64
Shape:
(5756075, 9)
Number of duplicated rows:
22068
#events['event_name'].unique()
# drop user_id as user_pseudo_id is enough
events = events.drop(columns=['user_id'])
from datetime import datetime
# change data types
events['event_date'] = pd.to_datetime(events['event_date'], format='%Y%m%d')
events['event_timestamp'] = pd.to_datetime(events['event_timestamp'], unit='us').dt.floor('s')
events['event_previous_timestamp'] = pd.to_datetime(events['event_previous_timestamp'], unit='us').dt.floor('s')
# change NaN to 0
events['event_value_in_usd']=events['event_value_in_usd'].fillna(value='0')
events.head()
| event_date | event_timestamp | event_name | event_previous_timestamp | event_value_in_usd | event_bundle_sequence_id | event_server_timestamp_offset | user_pseudo_id | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2021-06-06 | 2021-06-06 13:45:50 | scr_breathe_home | 2021-04-05 07:27:21 | 0 | 30 | 874711 | deb456bc2ea690aa5544c7f056e46747 |
| 1 | 2021-06-06 | 2021-06-06 13:45:52 | scr_breathe | 2021-04-05 07:27:22 | 0 | 30 | 874711 | deb456bc2ea690aa5544c7f056e46747 |
| 2 | 2021-06-06 | 2021-06-06 13:46:03 | scr_breathe_home | 2021-06-06 13:45:49 | 0 | 30 | 874711 | deb456bc2ea690aa5544c7f056e46747 |
| 3 | 2021-06-06 | 2021-06-06 13:46:03 | scr_AssetsLoaderPage | 2021-06-06 13:45:41 | 0 | 30 | 874711 | deb456bc2ea690aa5544c7f056e46747 |
| 4 | 2021-06-06 | 2021-06-06 13:46:03 | scr_training_details | 2021-06-06 13:45:39 | 0 | 30 | 874711 | deb456bc2ea690aa5544c7f056e46747 |
# change NaN for event_previous timestamp to event_timestamp
events['event_previous_timestamp']=events['event_previous_timestamp'].fillna(value=events['event_timestamp'])
events['event_previous_timestamp'].min()
Timestamp('1970-01-01 00:03:32')
events['event_previous_timestamp'].max()
Timestamp('2037-02-26 18:41:00')
# let's change strange values of event_previous_timestamp to event_timestamp
min_date = events['event_timestamp'].min()
max_date = events['event_timestamp'].max()
events.loc[events['event_previous_timestamp'] < min_date, 'event_previous_timestamp'] = events['event_timestamp']
events.loc[events['event_previous_timestamp'] > max_date, 'event_previous_timestamp'] = events['event_timestamp']
events['event_previous_timestamp'].min()
Timestamp('2021-04-04 21:21:17')
events['event_previous_timestamp'].max()
Timestamp('2021-06-06 17:16:48')
events=events.drop_duplicates()
events['event_value_in_usd']=events['event_value_in_usd'].fillna(value=0)
events['event_value_in_usd']=events['event_value_in_usd'].astype('float64')
get_info(events)
Head:
| event_date | event_timestamp | event_name | event_previous_timestamp | event_value_in_usd | event_bundle_sequence_id | event_server_timestamp_offset | user_pseudo_id | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2021-06-06 | 2021-06-06 13:45:50 | scr_breathe_home | 2021-04-05 07:27:21 | 0.0 | 30 | 874711 | deb456bc2ea690aa5544c7f056e46747 |
| 1 | 2021-06-06 | 2021-06-06 13:45:52 | scr_breathe | 2021-04-05 07:27:22 | 0.0 | 30 | 874711 | deb456bc2ea690aa5544c7f056e46747 |
| 2 | 2021-06-06 | 2021-06-06 13:46:03 | scr_breathe_home | 2021-06-06 13:45:49 | 0.0 | 30 | 874711 | deb456bc2ea690aa5544c7f056e46747 |
| 3 | 2021-06-06 | 2021-06-06 13:46:03 | scr_AssetsLoaderPage | 2021-06-06 13:45:41 | 0.0 | 30 | 874711 | deb456bc2ea690aa5544c7f056e46747 |
| 4 | 2021-06-06 | 2021-06-06 13:46:03 | scr_training_details | 2021-06-06 13:45:39 | 0.0 | 30 | 874711 | deb456bc2ea690aa5544c7f056e46747 |
Info: <class 'pandas.core.frame.DataFrame'> Int64Index: 5710245 entries, 0 to 5756074 Data columns (total 8 columns): # Column Dtype --- ------ ----- 0 event_date datetime64[ns] 1 event_timestamp datetime64[ns] 2 event_name object 3 event_previous_timestamp datetime64[ns] 4 event_value_in_usd float64 5 event_bundle_sequence_id int64 6 event_server_timestamp_offset int64 7 user_pseudo_id object dtypes: datetime64[ns](3), float64(1), int64(2), object(2) memory usage: 392.1+ MB
None
Description of quantitative parameters:
| event_value_in_usd | event_bundle_sequence_id | event_server_timestamp_offset | |
|---|---|---|---|
| count | 5.710245e+06 | 5.710245e+06 | 5.710245e+06 |
| mean | 2.303561e-04 | 3.268779e+01 | 6.711165e+08 |
| std | 6.344244e-02 | 1.573004e+02 | 3.491205e+11 |
| min | 0.000000e+00 | 1.000000e+00 | 0.000000e+00 |
| 25% | 0.000000e+00 | 4.000000e+00 | 4.659850e+05 |
| 50% | 0.000000e+00 | 7.000000e+00 | 1.240593e+06 |
| 75% | 0.000000e+00 | 1.500000e+01 | 2.213010e+06 |
| max | 3.451224e+01 | 2.100000e+03 | 2.023331e+14 |
Description of categorical parameters:
| event_name | user_pseudo_id | |
|---|---|---|
| count | 5710245 | 5710245 |
| unique | 181 | 83884 |
| top | scr_home | ededd30e5bf41b0df785f14373c06b58 |
| freq | 619776 | 38700 |
Columns with NaN values:
event_date 0 event_timestamp 0 event_name 0 event_previous_timestamp 0 event_value_in_usd 0 event_bundle_sequence_id 0 event_server_timestamp_offset 0 user_pseudo_id 0 dtype: int64
Shape:
(5710245, 8)
Number of duplicated rows:
0
len(users['user_pseudo_id'].unique())
83908
# replace NaN with '{}' if the column is strings, otherwise replace with {}
# df.Pollutants = df.Pollutants.fillna('{}') # if the NaN is in a column of strings
users.user_ltv = users.user_ltv.fillna({i: {} for i in users.index})
# if the column is not strings
# reset the index if the index is not unique integers from 0 to n-1
# df.reset_index(inplace=True) # uncomment if needed
# normalize the column of dictionaries and join it to df
users = users.join(pd.json_normalize(users.user_ltv))
# drop Pollutants
users.drop(columns=['user_ltv'], inplace=True)
users['user_first_touch_timestamp'] = pd.to_datetime(events['event_timestamp'], unit='us').dt.floor('s')
get_info(users)
Head:
| user_pseudo_id | user_first_touch_timestamp | revenue | currency | |
|---|---|---|---|---|
| 0 | 948ba94877ac58a1cc65c519a2ece1f7 | 2021-06-06 13:45:50 | NaN | NaN |
| 1 | a625fb60fdf70d36529d3d7709b02517 | 2021-06-06 13:45:52 | NaN | NaN |
| 2 | fe0dc05bfa4086c4df499a482f889c9c | 2021-06-06 13:46:03 | NaN | NaN |
| 3 | 697897111de070f8ca650d14441051b6 | 2021-06-06 13:46:03 | NaN | NaN |
| 4 | 89f96008d6cbb7651cf0abf6f0bfbb7f | 2021-06-06 13:46:03 | NaN | NaN |
Info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 5758306 entries, 0 to 5758305 Data columns (total 4 columns): # Column Dtype --- ------ ----- 0 user_pseudo_id object 1 user_first_touch_timestamp datetime64[ns] 2 revenue float64 3 currency object dtypes: datetime64[ns](1), float64(1), object(2) memory usage: 175.7+ MB
None
Description of quantitative parameters:
| revenue | |
|---|---|
| count | 129688.000000 |
| mean | 11.226831 |
| std | 9.951014 |
| min | 0.000000 |
| 25% | 2.698778 |
| 50% | 8.224334 |
| 75% | 20.161548 |
| max | 101.887309 |
Description of categorical parameters:
| user_pseudo_id | currency | |
|---|---|---|
| count | 5758306 | 129688 |
| unique | 83908 | 1 |
| top | ededd30e5bf41b0df785f14373c06b58 | USD |
| freq | 39906 | 129688 |
Columns with NaN values:
user_pseudo_id 0 user_first_touch_timestamp 48061 revenue 5628618 currency 5628618 dtype: int64
Shape:
(5758306, 4)
Number of duplicated rows:
2247963
users.drop_duplicates(inplace=True)
users['revenue']=users['revenue'].fillna(value=0)
users['currency']=users['currency'].fillna(value=0)
users.head()
| user_pseudo_id | user_first_touch_timestamp | revenue | currency | |
|---|---|---|---|---|
| 0 | 948ba94877ac58a1cc65c519a2ece1f7 | 2021-06-06 13:45:50 | 0.0 | 0 |
| 1 | a625fb60fdf70d36529d3d7709b02517 | 2021-06-06 13:45:52 | 0.0 | 0 |
| 2 | fe0dc05bfa4086c4df499a482f889c9c | 2021-06-06 13:46:03 | 0.0 | 0 |
| 3 | 697897111de070f8ca650d14441051b6 | 2021-06-06 13:46:03 | 0.0 | 0 |
| 4 | 89f96008d6cbb7651cf0abf6f0bfbb7f | 2021-06-06 13:46:03 | 0.0 | 0 |
pd.set_option('display.max_colwidth', 1000)
devices.head()
| user_pseudo_id | device | |
|---|---|---|
| 0 | b01d6b0818c608ce3cb537401b84c5ae | {'category': 'mobile', 'mobile_brand_name': 'Samsung', 'mobile_model_name': 'SM-G981V', 'mobile_marketing_name': 'Galaxy S20 5G', 'mobile_os_hardware_model': 'SM-G981V', 'operating_system': 'Android', 'operating_system_version': 'Android 11', 'vendor_id': None, 'advertising_id': None, 'language': 'en-us', 'is_limited_ad_tracking': 'No', 'time_zone_offset_seconds': -14400, 'browser': None, 'browser_version': None, 'web_info': None} |
| 1 | 42c91eb15cc375191cbebc15a3b794a8 | {'category': 'mobile', 'mobile_brand_name': 'Samsung', 'mobile_model_name': 'SM-S102DL', 'mobile_marketing_name': 'Galaxy A10e', 'mobile_os_hardware_model': 'SM-S102DL', 'operating_system': 'Android', 'operating_system_version': 'Android 10', 'vendor_id': None, 'advertising_id': None, 'language': 'en-us', 'is_limited_ad_tracking': 'No', 'time_zone_offset_seconds': -14400, 'browser': None, 'browser_version': None, 'web_info': None} |
| 2 | 4a988bee24ba50c9fdc63605c9ece1e2 | {'category': 'mobile', 'mobile_brand_name': 'Infinix', 'mobile_model_name': 'X557', 'mobile_marketing_name': 'Infinix HOT 4', 'mobile_os_hardware_model': 'Infinix HOT 4', 'operating_system': 'Android', 'operating_system_version': 'Android 6.0', 'vendor_id': None, 'advertising_id': None, 'language': 'fr-fr', 'is_limited_ad_tracking': 'No', 'time_zone_offset_seconds': 0, 'browser': None, 'browser_version': None, 'web_info': None} |
| 3 | 4a988bee24ba50c9fdc63605c9ece1e2 | {'category': 'mobile', 'mobile_brand_name': 'Infinix', 'mobile_model_name': 'X557', 'mobile_marketing_name': 'Infinix HOT 4', 'mobile_os_hardware_model': 'Infinix HOT 4', 'operating_system': 'Android', 'operating_system_version': 'Android 6.0', 'vendor_id': None, 'advertising_id': None, 'language': 'fr-fr', 'is_limited_ad_tracking': 'No', 'time_zone_offset_seconds': 0, 'browser': None, 'browser_version': None, 'web_info': None} |
| 4 | 4a988bee24ba50c9fdc63605c9ece1e2 | {'category': 'mobile', 'mobile_brand_name': 'Infinix', 'mobile_model_name': 'X557', 'mobile_marketing_name': 'Infinix HOT 4', 'mobile_os_hardware_model': 'Infinix HOT 4', 'operating_system': 'Android', 'operating_system_version': 'Android 6.0', 'vendor_id': None, 'advertising_id': None, 'language': 'fr-fr', 'is_limited_ad_tracking': 'No', 'time_zone_offset_seconds': 0, 'browser': None, 'browser_version': None, 'web_info': None} |
Let's extract relevant data from 'device' column
def expand_dict(df, column, param):
df[column] = df[param].apply(lambda x: x[column])
expand_dict(devices,'category', 'device')
expand_dict(devices,'mobile_brand_name', 'device')
expand_dict(devices,'operating_system_version', 'device')
expand_dict(devices,'language', 'device')
devices=devices.drop(labels='device', axis=1)
get_info(devices)
Head:
| user_pseudo_id | category | mobile_brand_name | operating_system_version | language | |
|---|---|---|---|---|---|
| 0 | b01d6b0818c608ce3cb537401b84c5ae | mobile | Samsung | Android 11 | en-us |
| 1 | 42c91eb15cc375191cbebc15a3b794a8 | mobile | Samsung | Android 10 | en-us |
| 2 | 4a988bee24ba50c9fdc63605c9ece1e2 | mobile | Infinix | Android 6.0 | fr-fr |
| 3 | 4a988bee24ba50c9fdc63605c9ece1e2 | mobile | Infinix | Android 6.0 | fr-fr |
| 4 | 4a988bee24ba50c9fdc63605c9ece1e2 | mobile | Infinix | Android 6.0 | fr-fr |
Info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 5758739 entries, 0 to 5758738 Data columns (total 5 columns): # Column Dtype --- ------ ----- 0 user_pseudo_id object 1 category object 2 mobile_brand_name object 3 operating_system_version object 4 language object dtypes: object(5) memory usage: 219.7+ MB
None
Description of quantitative parameters:
| user_pseudo_id | category | mobile_brand_name | operating_system_version | language | |
|---|---|---|---|---|---|
| count | 5758739 | 5758739 | 5700656 | 5758739 | 5758739 |
| unique | 83912 | 2 | 195 | 94 | 450 |
| top | ededd30e5bf41b0df785f14373c06b58 | mobile | Samsung | Android 10 | fa-ir |
| freq | 39906 | 5596078 | 2908286 | 2618395 | 2946445 |
Description of categorical parameters:
| user_pseudo_id | category | mobile_brand_name | operating_system_version | language | |
|---|---|---|---|---|---|
| count | 5758739 | 5758739 | 5700656 | 5758739 | 5758739 |
| unique | 83912 | 2 | 195 | 94 | 450 |
| top | ededd30e5bf41b0df785f14373c06b58 | mobile | Samsung | Android 10 | fa-ir |
| freq | 39906 | 5596078 | 2908286 | 2618395 | 2946445 |
Columns with NaN values:
user_pseudo_id 0 category 0 mobile_brand_name 58083 operating_system_version 0 language 0 dtype: int64
Shape:
(5758739, 5)
Number of duplicated rows:
5671652
devices['mobile_brand_name']=devices['mobile_brand_name'].fillna(value='unknown')
devices.drop_duplicates(inplace=True)
geo.head()
| user_pseudo_id | geo | |
|---|---|---|
| 0 | 1DA3967CD6764C5DB22BA97B7A997D24 | {'continent': 'Oceania', 'country': 'Australia', 'region': 'Victoria', 'city': 'Melbourne', 'sub_continent': 'Australasia', 'metro': '(not set)'} |
| 1 | 1DA3967CD6764C5DB22BA97B7A997D24 | {'continent': 'Oceania', 'country': 'Australia', 'region': 'Victoria', 'city': 'Melbourne', 'sub_continent': 'Australasia', 'metro': '(not set)'} |
| 2 | 5B568F76526648E7AAB6C2641D84EBDB | {'continent': 'Asia', 'country': 'China', 'region': 'Guangdong Province', 'city': 'Guangzhou', 'sub_continent': 'Eastern Asia', 'metro': '(not set)'} |
| 3 | 5B568F76526648E7AAB6C2641D84EBDB | {'continent': 'Asia', 'country': 'China', 'region': 'Guangdong Province', 'city': 'Guangzhou', 'sub_continent': 'Eastern Asia', 'metro': '(not set)'} |
| 4 | 5B568F76526648E7AAB6C2641D84EBDB | {'continent': 'Asia', 'country': 'China', 'region': 'Guangdong Province', 'city': 'Guangzhou', 'sub_continent': 'Eastern Asia', 'metro': '(not set)'} |
expand_dict(geo,'country', 'geo')
expand_dict(geo,'city', 'geo')
geo=geo.drop(labels='geo', axis=1)
get_info(geo)
Head:
| user_pseudo_id | country | city | |
|---|---|---|---|
| 0 | 1DA3967CD6764C5DB22BA97B7A997D24 | Australia | Melbourne |
| 1 | 1DA3967CD6764C5DB22BA97B7A997D24 | Australia | Melbourne |
| 2 | 5B568F76526648E7AAB6C2641D84EBDB | China | Guangzhou |
| 3 | 5B568F76526648E7AAB6C2641D84EBDB | China | Guangzhou |
| 4 | 5B568F76526648E7AAB6C2641D84EBDB | China | Guangzhou |
Info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 5760198 entries, 0 to 5760197 Data columns (total 3 columns): # Column Dtype --- ------ ----- 0 user_pseudo_id object 1 country object 2 city object dtypes: object(3) memory usage: 131.8+ MB
None
Description of quantitative parameters:
| user_pseudo_id | country | city | |
|---|---|---|---|
| count | 5760198 | 5760198 | 5745482 |
| unique | 83928 | 209 | 5350 |
| top | ededd30e5bf41b0df785f14373c06b58 | Iran | |
| freq | 39906 | 2596024 | 1155460 |
Description of categorical parameters:
| user_pseudo_id | country | city | |
|---|---|---|---|
| count | 5760198 | 5760198 | 5745482 |
| unique | 83928 | 209 | 5350 |
| top | ededd30e5bf41b0df785f14373c06b58 | Iran | |
| freq | 39906 | 2596024 | 1155460 |
Columns with NaN values:
user_pseudo_id 0 country 0 city 14716 dtype: int64
Shape:
(5760198, 3)
Number of duplicated rows:
5636198
# replace NaN with '{}' if the column is strings, otherwise replace with {}
# df.Pollutants = df.Pollutants.fillna('{}') # if the NaN is in a column of strings
traffic.traffic_source = traffic.traffic_source.fillna({i: {} for i in traffic.index}) # if the column is not strings
# reset the index if the index is not unique integers from 0 to n-1
# df.reset_index(inplace=True) # uncomment if needed
# normalize the column of dictionaries and join it to df
traffic = traffic.join(pd.json_normalize(traffic.traffic_source))
# drop Pollutants
traffic.drop(columns=['traffic_source'], inplace=True)
traffic.head()
| user_pseudo_id | name | medium | source | |
|---|---|---|---|---|
| 0 | 0cbfacb17dac3d66a049c15a243cbcf3 | insta | ||
| 1 | 0cbfacb17dac3d66a049c15a243cbcf3 | insta | ||
| 2 | 0cbfacb17dac3d66a049c15a243cbcf3 | insta | ||
| 3 | 0cbfacb17dac3d66a049c15a243cbcf3 | insta | ||
| 4 | 0cbfacb17dac3d66a049c15a243cbcf3 | insta |
get_info(traffic)
Head:
| user_pseudo_id | name | medium | source | |
|---|---|---|---|---|
| 0 | 0cbfacb17dac3d66a049c15a243cbcf3 | insta | ||
| 1 | 0cbfacb17dac3d66a049c15a243cbcf3 | insta | ||
| 2 | 0cbfacb17dac3d66a049c15a243cbcf3 | insta | ||
| 3 | 0cbfacb17dac3d66a049c15a243cbcf3 | insta | ||
| 4 | 0cbfacb17dac3d66a049c15a243cbcf3 | insta |
Info: <class 'pandas.core.frame.DataFrame'> RangeIndex: 5761161 entries, 0 to 5761160 Data columns (total 4 columns): # Column Dtype --- ------ ----- 0 user_pseudo_id object 1 name object 2 medium object 3 source object dtypes: object(4) memory usage: 175.8+ MB
None
Description of quantitative parameters:
| user_pseudo_id | name | medium | source | |
|---|---|---|---|---|
| count | 5761161 | 578613 | 5691688 | 5692543 |
| unique | 83939 | 10 | 14 | 14 |
| top | ededd30e5bf41b0df785f14373c06b58 | (direct) | organic | google-play |
| freq | 39906 | 560990 | 5106850 | 5092546 |
Description of categorical parameters:
| user_pseudo_id | name | medium | source | |
|---|---|---|---|---|
| count | 5761161 | 578613 | 5691688 | 5692543 |
| unique | 83939 | 10 | 14 | 14 |
| top | ededd30e5bf41b0df785f14373c06b58 | (direct) | organic | google-play |
| freq | 39906 | 560990 | 5106850 | 5092546 |
Columns with NaN values:
user_pseudo_id 0 name 5182548 medium 69473 source 68618 dtype: int64
Shape:
(5761161, 4)
Number of duplicated rows:
5676963
events.head()
| event_date | event_timestamp | event_name | event_previous_timestamp | event_value_in_usd | event_bundle_sequence_id | event_server_timestamp_offset | user_pseudo_id | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2021-06-06 | 2021-06-06 13:45:50 | scr_breathe_home | 2021-04-05 07:27:21 | 0.0 | 30 | 874711 | deb456bc2ea690aa5544c7f056e46747 |
| 1 | 2021-06-06 | 2021-06-06 13:45:52 | scr_breathe | 2021-04-05 07:27:22 | 0.0 | 30 | 874711 | deb456bc2ea690aa5544c7f056e46747 |
| 2 | 2021-06-06 | 2021-06-06 13:46:03 | scr_breathe_home | 2021-06-06 13:45:49 | 0.0 | 30 | 874711 | deb456bc2ea690aa5544c7f056e46747 |
| 3 | 2021-06-06 | 2021-06-06 13:46:03 | scr_AssetsLoaderPage | 2021-06-06 13:45:41 | 0.0 | 30 | 874711 | deb456bc2ea690aa5544c7f056e46747 |
| 4 | 2021-06-06 | 2021-06-06 13:46:03 | scr_training_details | 2021-06-06 13:45:39 | 0.0 | 30 | 874711 | deb456bc2ea690aa5544c7f056e46747 |
print('We have events from', events['event_date'].min(), 'to', events['event_date'].max())
We have events from 2021-04-05 00:00:00 to 2021-06-06 00:00:00
print('We have', len(events) ,'events from', len(events['user_pseudo_id'].unique()), 'users')
We have 5710245 events from 83884 users
print('We have', len(events['event_name'].unique()),'different types of events')
We have 181 different types of events
Let's define several most common user scenarios and investigate those.
Then different trainings:
scr_ballshome scr_ballsgame result_game
select_training_qh0 - Manage your stress start_training_qh0
? scr_intro2 User was on screen intro 2 intro_next_btn_tapped on intro2 the user clicked on the 'continue' button
After trainings user should get to the page with a survey "What you feel?": scr_whatyoufeel
events_count=events.pivot_table(index='event_name', values='user_pseudo_id', aggfunc={'count','nunique'})
events_count=events_count.reset_index()
events_count.sort_values('count', ascending=False).head(60)
| event_name | count | nunique | |
|---|---|---|---|
| 49 | scr_home | 619776 | 63033 |
| 68 | scr_training_details | 549421 | 51972 |
| 180 | user_engagement | 386238 | 71239 |
| 44 | scr_brain_progress | 307415 | 32459 |
| 76 | screen_view | 238296 | 72456 |
| 46 | scr_breathe_home | 228826 | 37925 |
| 90 | session_start | 213766 | 72598 |
| 43 | scr_ballshome | 206274 | 35839 |
| 45 | scr_breathe | 205833 | 34994 |
| 42 | scr_ballsgame | 200244 | 35405 |
| 64 | scr_stress_level | 180286 | 29606 |
| 34 | result_game | 92451 | 30548 |
| 33 | result_breathe | 87323 | 25110 |
| 81 | select_training_qh0 | 79288 | 35443 |
| 72 | scr_unlock_start | 78620 | 18865 |
| 74 | scr_wiki | 77476 | 25599 |
| 65 | scr_survey | 74137 | 50913 |
| 39 | scr_GongMeditationPage | 70671 | 16010 |
| 108 | survey_start | 67873 | 48747 |
| 63 | scr_statistics | 64632 | 26461 |
| 38 | scr_AssetsLoaderPage | 63473 | 9461 |
| 67 | scr_survey_result | 61397 | 42781 |
| 54 | scr_meditationhome | 58607 | 15898 |
| 51 | scr_intro2 | 58137 | 50928 |
| 97 | start_training_qh0 | 56311 | 28506 |
| 75 | scr_wikicontents | 55559 | 25793 |
| 66 | scr_survey_list | 55089 | 24657 |
| 53 | scr_meditation | 54231 | 15699 |
| 73 | scr_whatyoufeel | 52427 | 13086 |
| 107 | survey_end | 51703 | 42577 |
| 26 | intro_next_btn_tapped | 51079 | 48551 |
| 23 | first_open | 50680 | 50624 |
| 69 | scr_training_list | 48842 | 20550 |
| 22 | firebase_campaign | 47929 | 47790 |
| 2 | app_remove | 41214 | 41142 |
| 88 | select_training_qh_breathe | 37070 | 19931 |
| 5 | assets_download_true | 35757 | 34765 |
| 84 | select_training_qh3 | 34765 | 18496 |
| 55 | scr_more | 33706 | 18844 |
| 35 | result_meditation | 33018 | 14505 |
| 62 | scr_start_finish_challenge | 30426 | 9391 |
| 104 | start_training_qh_breathe | 29425 | 17003 |
| 85 | select_training_qh4 | 28487 | 15990 |
| 83 | select_training_qh2 | 25873 | 16246 |
| 36 | result_session | 24661 | 12605 |
| 37 | result_what_you_feel | 23156 | 12075 |
| 87 | select_training_qh6 | 22960 | 14117 |
| 4 | assets_download_false | 22588 | 10337 |
| 101 | start_training_qh4 | 21095 | 13196 |
| 82 | select_training_qh1 | 20547 | 12930 |
| 99 | start_training_qh2 | 19570 | 13641 |
| 100 | start_training_qh3 | 19295 | 9701 |
| 13 | finish_training_qh0 | 18723 | 10625 |
| 47 | scr_bs_meditation | 18702 | 9413 |
| 98 | start_training_qh1 | 18693 | 13102 |
| 106 | survey_close_pressed | 18133 | 15367 |
| 120 | unlock_click_for_free_qh0 | 17936 | 13159 |
| 171 | unlock_start_qh0 | 16351 | 12284 |
| 86 | select_training_qh5 | 16334 | 10850 |
| 103 | start_training_qh6 | 15855 | 10617 |
Calculate the proportion of users who performed the action at least once
events_count['share'] = round((events_count['nunique'] / events['user_pseudo_id'].nunique())*100)
events_count=events_count.sort_values('share', ascending=False)
fig=px.bar(events_count.head(10), x='event_name', y='share', title='Events share')
fig.show()
breathe_training=['scr_breathe_home', 'scr_breathe', 'result_breathe', 'scr_whatyoufeel']
events_breathe = events_count.query('event_name in @breathe_training')
fig = go.Figure(
go.Funnel(
y = events_breathe['event_name'],
x = events_breathe['nunique'],
marker = {"color": 'steelblue'},
textposition = "inside",
textinfo = "value+percent initial"),
go.Layout(title = 'Event funnel: breathing exercise')
)
fig.show()
# pio.write_html(fig, file='event_funnel_breathe.html', auto_open=True)
balls_game=['scr_ballshome', 'scr_ballsgame', 'result_game']
events_balls = events_count.query('event_name in @balls_game')
fig = go.Figure(
go.Funnel(
y = events_balls['event_name'],
x = events_balls['nunique'],
marker = {"color": 'steelblue'},
textposition = "inside",
textinfo = "value+percent initial"),
go.Layout(title = 'Event funnel: balls game')
)
fig.show()
# import plotly.io as pio
# pio.write_html(fig, file='figure.html', auto_open=True)
mbsc_training=['select_training_qh0', 'start_training_qh0', 'scr_stress_level','finish_training_qh0']
events_mbsc = events_count.query('event_name in @mbsc_training')
fig = go.Figure(
go.Funnel(
y = events_mbsc['event_name'],
x = events_mbsc['nunique'],
marker = {"color": 'steelblue'},
textposition = "inside",
textinfo = "value+percent initial"),
go.Layout(title = 'Event funnel: Manage stress training (MBSC)')
)
fig.show()
# import plotly.io as pio
# pio.write_html(fig, file='event_funnel_mbsc.html', auto_open=True)
fig = go.Figure(
go.Funnel(
y = events_count['event_name'],
x = events_count['nunique'],
marker = {"color": 'steelblue'},
textposition = "inside",
textinfo = "value+percent initial"),
go.Layout(title = 'Share of users that proceed from each stage to the next')
)
fig.show()
events['week'] = events['event_date'].dt.isocalendar().week
events_per_week = events.pivot_table(index='week', values='event_name', aggfunc='count')
events_per_week.reset_index(inplace=True)
fig=px.bar(events_per_week, x='week', y='event_name', title='Number of events per week')
fig.show()
events.head()
| event_date | event_timestamp | event_name | event_previous_timestamp | event_value_in_usd | event_bundle_sequence_id | event_server_timestamp_offset | user_pseudo_id | week | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2021-06-06 | 2021-06-06 13:45:50 | scr_breathe_home | 2021-04-05 07:27:21 | 0.0 | 30 | 874711 | deb456bc2ea690aa5544c7f056e46747 | 22 |
| 1 | 2021-06-06 | 2021-06-06 13:45:52 | scr_breathe | 2021-04-05 07:27:22 | 0.0 | 30 | 874711 | deb456bc2ea690aa5544c7f056e46747 | 22 |
| 2 | 2021-06-06 | 2021-06-06 13:46:03 | scr_breathe_home | 2021-06-06 13:45:49 | 0.0 | 30 | 874711 | deb456bc2ea690aa5544c7f056e46747 | 22 |
| 3 | 2021-06-06 | 2021-06-06 13:46:03 | scr_AssetsLoaderPage | 2021-06-06 13:45:41 | 0.0 | 30 | 874711 | deb456bc2ea690aa5544c7f056e46747 | 22 |
| 4 | 2021-06-06 | 2021-06-06 13:46:03 | scr_training_details | 2021-06-06 13:45:39 | 0.0 | 30 | 874711 | deb456bc2ea690aa5544c7f056e46747 | 22 |
events_per_day = events.pivot_table(index='event_date', values='event_name', aggfunc='count')
events_per_day.reset_index(inplace=True)
fig=px.line(events_per_day, x='event_date', y='event_name', title='Number of events per day')
fig.show()
users_per_day = events.pivot_table(index='event_date', values='user_pseudo_id', aggfunc='nunique')
users_per_day.reset_index(inplace=True)
fig=px.line(users_per_day, x='event_date', y='user_pseudo_id', title='Number of users per day')
fig.show()
# df1 = df.melt(id_vars=['Date']+list(df.keys()[5:]), var_name='AAPL')
# px.line(df1, x='Date', y='value', color='AAPL' )
users_per_day.head()
| event_date | user_pseudo_id | |
|---|---|---|
| 0 | 2021-04-05 | 58 |
| 1 | 2021-04-06 | 461 |
| 2 | 2021-04-07 | 3703 |
| 3 | 2021-04-08 | 3488 |
| 4 | 2021-04-09 | 3558 |
fig=px.line(events_per_day, x='event_date', y='event_name', title='Number of events and users per day', log_y=True)
# Only thing I figured is - I could do this
fig.add_scatter(x=users_per_day['event_date'], y=users_per_day['user_pseudo_id'], name='users')
# Show plot
fig.show()
# pio.write_html(fig, file='users_events.html', auto_open=False)
events_per_user = events.groupby(['week']).agg({'user_pseudo_id': ['count','nunique']})
events_per_user.columns = ['events', 'users']
events_per_user['events_per_user'] = (events_per_user['events'] / events_per_user['users']).round(2)
events_per_user=events_per_user.reset_index()
fig = px.bar(events_per_user, x="week", y='events_per_user', title="Average number of events per user by week")
fig.show()
events['month'] = events['event_timestamp'].astype('datetime64[M]')
dau_total = events.groupby('event_date').agg({'user_pseudo_id': 'nunique'}).mean()
wau_total = events.groupby(['week']).agg({'user_pseudo_id': 'nunique'}).mean()
print('Daily Active Users: {:.0f}'.format(int(dau_total)))
Daily Active Users: 3201
print('Weekly Active Users: {:.0f}'.format(int(wau_total)))
Weekly Active Users: 15519
print('Weekly Sticky factor:', round(int(dau_total) / (int(wau_total)) * 100, 1))
Weekly Sticky factor: 20.6
first_session = events.groupby('user_pseudo_id')['event_timestamp'].min()
first_session.name = 'first_session_week'
retention_rate = events.join(first_session, on='user_pseudo_id')
retention_rate['first_session_week'] = retention_rate['first_session_week'].astype('datetime64[W]')
retention_rate['session_week'] = retention_rate['event_timestamp'].astype('datetime64[W]')
retention_rate['cohort_lifetime'] = retention_rate['session_week'] - retention_rate['first_session_week']
retention_rate['cohort_lifetime'] = retention_rate['cohort_lifetime'] / np.timedelta64(1,'W')
retention_rate['cohort_lifetime'] = retention_rate['cohort_lifetime'].round().astype('int')
cohorts_retention = (retention_rate
.groupby(['first_session_week', 'cohort_lifetime'])
.agg({'user_pseudo_id': 'nunique'})
.reset_index()
)
init_users_count = cohorts_retention[cohorts_retention['cohort_lifetime']==0][['first_session_week', 'user_pseudo_id']]
init_users_count = init_users_count.rename(columns={'user_pseudo_id': 'cohort_users'})
cohorts_retention = cohorts_retention.merge(init_users_count, on='first_session_week')
cohorts_retention['retention'] = cohorts_retention['user_pseudo_id']/cohorts_retention['cohort_users']
retention_pivot_rel = cohorts_retention.pivot_table(index='first_session_week', columns='cohort_lifetime',
values='retention', aggfunc='sum')
retention_pivot_rel = (retention_pivot_rel).round(4)
retention_mean = pd.Series(retention_pivot_rel.apply('mean', axis=0))
retention_mean.name = 'retention_mean'
retention_pivot_rel= retention_pivot_rel.append(retention_mean)
plt.figure(figsize=(13, 9))
plt.title('Retention rate')
sns.heatmap(retention_pivot_rel, vmax="0.4", annot=True, fmt='.1%', linewidths=1, linecolor='gray')
plt.show()
# fig=px.imshow(retention_pivot_rel)
# fig.show()
# import plotly.figure_factory as ff
# fig = ff.create_annotated_heatmap(retention_pivot_rel)
# fig.show()
users.head()
| user_pseudo_id | user_first_touch_timestamp | revenue | currency | |
|---|---|---|---|---|
| 0 | 948ba94877ac58a1cc65c519a2ece1f7 | 2021-06-06 13:45:50 | 0.0 | 0 |
| 1 | a625fb60fdf70d36529d3d7709b02517 | 2021-06-06 13:45:52 | 0.0 | 0 |
| 2 | fe0dc05bfa4086c4df499a482f889c9c | 2021-06-06 13:46:03 | 0.0 | 0 |
| 3 | 697897111de070f8ca650d14441051b6 | 2021-06-06 13:46:03 | 0.0 | 0 |
| 4 | 89f96008d6cbb7651cf0abf6f0bfbb7f | 2021-06-06 13:46:03 | 0.0 | 0 |
users['currency'].unique()
array([0, 'USD'], dtype=object)
print('Overall revenue is', round(users['revenue'].sum(),2), 'USD')
Overall revenue is 886828.2 USD
print('Overall in-app purchase sum is', round(events['event_value_in_usd'].sum(),2), 'USD')
Overall in-app purchase sum is 1315.39 USD
Amount of revenue and in-app purchases differs, may be not all purchases are recorded.
users_ltv=users.query('revenue > 0')
users_ltv
| user_pseudo_id | user_first_touch_timestamp | revenue | currency | |
|---|---|---|---|---|
| 36121 | 25bdac266b76e87bf3cbc78c39f1380d | 2021-05-21 08:57:02 | 2.61699 | USD |
| 36123 | 25bdac266b76e87bf3cbc78c39f1380d | 2021-05-21 08:57:05 | 2.61699 | USD |
| 36124 | 25bdac266b76e87bf3cbc78c39f1380d | 2021-05-21 08:57:09 | 2.61699 | USD |
| 36126 | 25bdac266b76e87bf3cbc78c39f1380d | 2021-05-21 08:57:10 | 2.61699 | USD |
| 36128 | 25bdac266b76e87bf3cbc78c39f1380d | 2021-05-21 08:57:17 | 2.61699 | USD |
| ... | ... | ... | ... | ... |
| 5753200 | 76c218feb50bfe77476e18df58842c6c | 2021-04-14 04:58:46 | 6.98000 | USD |
| 5753202 | 76c218feb50bfe77476e18df58842c6c | 2021-04-14 04:58:53 | 6.98000 | USD |
| 5753204 | 76c218feb50bfe77476e18df58842c6c | 2021-04-14 04:58:59 | 6.98000 | USD |
| 5753206 | 76c218feb50bfe77476e18df58842c6c | 2021-04-14 04:59:01 | 6.98000 | USD |
| 5754397 | 28c7a52af691086d344c6c1b0b114b71 | 2021-04-14 18:00:59 | 1.99000 | USD |
66699 rows × 4 columns
users_ltv['week']=users_ltv['user_first_touch_timestamp'].astype('datetime64[W]')
first_orders = users_ltv.groupby('user_pseudo_id').agg({'week': 'min'}).reset_index()
first_orders.columns = ['user_pseudo_id', 'first_order_week']
first_orders
<ipython-input-119-9f97d03ba03d>:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
| user_pseudo_id | first_order_week | |
|---|---|---|
| 0 | 0001e0eff587ecc6cc00000618603263 | 2021-04-08 |
| 1 | 004abf8c00fb304d187fbbcb93c1954e | 2021-04-15 |
| 2 | 0100c99ac335e15ffd2446310f0c58af | 2021-04-08 |
| 3 | 0137d04e61d6957102ff1961b3a0542f | 2021-04-08 |
| 4 | 01e904effe825686ba520131539946a9 | 2021-04-08 |
| ... | ... | ... |
| 461 | fd962eb58e9d5cfac8e2ce438d984d68 | 2021-04-22 |
| 462 | fe69a4d7c022f6cd1160568a2d57e1d9 | 2021-05-06 |
| 463 | fe858037c616755cc7c87f175d07122e | 2021-04-01 |
| 464 | ff30a872f4ffff7be05235122f1fb740 | 2021-04-01 |
| 465 | ff49f7e265bdcd5aa42ee75a8e42b412 | 2021-05-13 |
466 rows × 2 columns
cohort_sizes = first_orders.groupby('first_order_week').agg({'user_pseudo_id': 'nunique'}).reset_index()
cohort_sizes.columns = ['first_order_month', 'n_byers']
# cohort_sizes
users_ltv.dropna(how='any', inplace=True)
<ipython-input-122-fc7ea6023a0c>:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
users_ltv['user_first_touch_timestamp'].isna().sum()
0
first_session = users_ltv.groupby('user_pseudo_id')['user_first_touch_timestamp'].min()
first_session.name = 'first_session_week'
ltv = users_ltv.join(first_session, on='user_pseudo_id')
ltv['first_session_week'] = ltv['first_session_week'].astype('datetime64[W]')
ltv['session_week'] = ltv['user_first_touch_timestamp'].astype('datetime64[W]')
ltv['cohort_lifetime'] = ltv['session_week'] - ltv['first_session_week']
ltv['cohort_lifetime'] = ltv['cohort_lifetime'] / np.timedelta64(1,'W')
ltv['cohort_lifetime'] = ltv['cohort_lifetime'].round().astype('int')
# len(users['user_pseudo_id'].unique())
user_ltv=ltv.groupby(['user_pseudo_id']).agg({'revenue': 'sum'})
# user_ltv
cohorts_ltv = (ltv.groupby(['first_session_week', 'cohort_lifetime'])
.agg({'revenue': 'mean'})
.reset_index()
)
ltv_pivot = cohorts_ltv.pivot_table(index='first_session_week', columns='cohort_lifetime',
values='revenue', aggfunc='mean')
ltv_pivot = (ltv_pivot).round(4)
ltv_pivot=ltv_pivot.cumsum(axis=1)
plt.figure(figsize=(13, 9))
plt.title('LTV')
sns.heatmap(ltv_pivot, annot=True, linewidths=1, linecolor='gray', fmt='.1f')
plt.show()
Let's investigate how many users stopped using the app. This will include users:
As we have just 2 months of data on users' events, let's try to define the period of inactivity after which we can consider the user to be lost. Let's check how much time passed from last event of each user.
last_user_event = events.groupby(['user_pseudo_id']).agg({'event_timestamp': 'max'})
last_user_event=last_user_event.reset_index()
last_user_event['inactivity_days']= (events['event_timestamp'].max() - last_user_event['event_timestamp']).dt.days
days_in_data=(events['event_timestamp'].max() - events['event_timestamp'].min()).days
days_in_data
62
fig = px.histogram(last_user_event, x="inactivity_days", nbins=days_in_data)
fig.show()
print('Number of inactive users', round(len(last_user_event.query('inactivity_days > 10'))))
Number of inactive users 63196
between one event and previous event - for how long people stop using the app and then return
We calculate for each user maximum period of inactivity: create a column 'time_between_events' and find maximum number:
events['time_between_events'] =(events['event_timestamp'] - events['event_previous_timestamp'])
events['time_between_events']=events['time_between_events'].astype('timedelta64[D]')
events.query('user_pseudo_id == "e912bf1973d919a571cc21ada130e6f4"').sort_values('time_between_events')
| event_date | event_timestamp | event_name | event_previous_timestamp | event_value_in_usd | event_bundle_sequence_id | event_server_timestamp_offset | user_pseudo_id | week | month | time_between_events | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1113308 | 2021-05-06 | 2021-05-05 23:23:02 | user_engagement | 2021-05-05 23:33:54 | 0.0 | 2 | 1700556 | e912bf1973d919a571cc21ada130e6f4 | 18 | 2021-05-01 | -1.0 |
| 1113307 | 2021-05-06 | 2021-05-05 23:23:02 | first_open | 2021-05-05 23:23:02 | 0.0 | 2 | 1700556 | e912bf1973d919a571cc21ada130e6f4 | 18 | 2021-05-01 | 0.0 |
| 3099300 | 2021-05-27 | 2021-05-26 22:42:17 | select_training_ch1 | 2021-05-26 22:42:17 | 0.0 | 13 | 2218317 | e912bf1973d919a571cc21ada130e6f4 | 21 | 2021-05-01 | 0.0 |
| 3099301 | 2021-05-27 | 2021-05-26 22:42:17 | scr_training_details | 2021-05-26 22:42:05 | 0.0 | 13 | 2218317 | e912bf1973d919a571cc21ada130e6f4 | 21 | 2021-05-01 | 0.0 |
| 3099302 | 2021-05-27 | 2021-05-26 22:42:20 | start_training_ch1 | 2021-05-26 22:42:20 | 0.0 | 13 | 2218317 | e912bf1973d919a571cc21ada130e6f4 | 21 | 2021-05-01 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3061361 | 2021-05-27 | 2021-05-26 22:24:22 | scr_stress_level | 2021-05-05 23:38:35 | 0.0 | 12 | 1451748 | e912bf1973d919a571cc21ada130e6f4 | 21 | 2021-05-01 | 20.0 |
| 3061367 | 2021-05-27 | 2021-05-26 22:27:07 | scr_breathe | 2021-05-05 23:29:38 | 0.0 | 12 | 1451748 | e912bf1973d919a571cc21ada130e6f4 | 21 | 2021-05-01 | 20.0 |
| 3061368 | 2021-05-27 | 2021-05-26 22:31:10 | result_breathe | 2021-05-05 23:33:41 | 0.0 | 12 | 1451748 | e912bf1973d919a571cc21ada130e6f4 | 21 | 2021-05-01 | 20.0 |
| 3061363 | 2021-05-27 | 2021-05-26 22:24:42 | scr_ballsgame | 2021-05-05 23:38:35 | 0.0 | 12 | 1451748 | e912bf1973d919a571cc21ada130e6f4 | 21 | 2021-05-01 | 20.0 |
| 3061360 | 2021-05-27 | 2021-05-26 22:24:22 | start_training_qh0 | 2021-05-05 23:35:55 | 0.0 | 12 | 1451748 | e912bf1973d919a571cc21ada130e6f4 | 21 | 2021-05-01 | 20.0 |
254 rows × 11 columns
inactivity_before_next_event = events.groupby(['user_pseudo_id']).agg({'time_between_events': 'max'})
inactivity_before_next_event=inactivity_before_next_event.reset_index()
# Let's drop users with 0 days of inactivity
inactivity_before_next_event=inactivity_before_next_event.query('time_between_events != 0')
fig = px.histogram(inactivity_before_next_event, x="time_between_events")
fig.show()
# pio.write_html(fig, file='inactivity.html', auto_open=False)
# inactivity_before_next_event['time_between_events'].np.percentile(75, 90)
p = np.percentile(inactivity_before_next_event['time_between_events'], 60)
p
11.0
events_remove=events.query('event_name == "app_remove"')
len(events_remove['user_pseudo_id'].unique())
41142
events_user_remove = events[events['user_pseudo_id'].isin(events_remove['user_pseudo_id'])]
len(events_user_remove['user_pseudo_id'].unique())
41142
len(events_user_remove['user_pseudo_id'].unique())/len(events['user_pseudo_id'].unique()) * 100
49.04630203614515
Share of users who remove the app is around 50% which corresponds to average uninstall rate in mobile app industry.
events_user_remove.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2180486 entries, 26 to 5756074 Data columns (total 11 columns): # Column Dtype --- ------ ----- 0 event_date datetime64[ns] 1 event_timestamp datetime64[ns] 2 event_name object 3 event_previous_timestamp datetime64[ns] 4 event_value_in_usd float64 5 event_bundle_sequence_id int64 6 event_server_timestamp_offset int64 7 user_pseudo_id object 8 week UInt32 9 month datetime64[ns] 10 time_between_events float64 dtypes: UInt32(1), datetime64[ns](4), float64(2), int64(2), object(2) memory usage: 193.4+ MB
events_inactive = last_user_event.query('inactivity_days > 10')
events_inactive.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 63196 entries, 1 to 83883 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 user_pseudo_id 63196 non-null object 1 event_timestamp 63196 non-null datetime64[ns] 2 inactivity_days 63196 non-null int64 dtypes: datetime64[ns](1), int64(1), object(1) memory usage: 1.9+ MB
users_churn = events_inactive.merge(events_remove, how='outer', on='user_pseudo_id')
users_churn.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 69927 entries, 0 to 69926 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 user_pseudo_id 69927 non-null object 1 event_timestamp_x 63256 non-null datetime64[ns] 2 inactivity_days 63256 non-null float64 3 event_date 41214 non-null datetime64[ns] 4 event_timestamp_y 41214 non-null datetime64[ns] 5 event_name 41214 non-null object 6 event_previous_timestamp 41214 non-null datetime64[ns] 7 event_value_in_usd 41214 non-null float64 8 event_bundle_sequence_id 41214 non-null float64 9 event_server_timestamp_offset 41214 non-null float64 10 week 41214 non-null UInt32 11 month 41214 non-null datetime64[ns] 12 time_between_events 41214 non-null float64 dtypes: UInt32(1), datetime64[ns](5), float64(5), object(2) memory usage: 7.3+ MB
print('Share of churn users', round(len(users_churn['user_pseudo_id'].unique()) / len(events['user_pseudo_id'].unique()) * 100), '%')
Share of churn users 83 %
Most active users
devices.head()
| user_pseudo_id | category | mobile_brand_name | operating_system_version | language | |
|---|---|---|---|---|---|
| 0 | b01d6b0818c608ce3cb537401b84c5ae | mobile | Samsung | Android 11 | en-us |
| 1 | 42c91eb15cc375191cbebc15a3b794a8 | mobile | Samsung | Android 10 | en-us |
| 2 | 4a988bee24ba50c9fdc63605c9ece1e2 | mobile | Infinix | Android 6.0 | fr-fr |
| 14 | 0112f26e97e5acc3c5a8b074cccc7756 | mobile | Samsung | Android 8.1.0 | ar-ma |
| 18 | b2c92f29ef588cb2331668631158b22f | mobile | unknown | Android 8.1.0 | pt-br |
devices_count=devices.groupby(['category']).agg('count')
devices_count=devices_count.reset_index()
fig = px.pie(devices_count, values='user_pseudo_id', title='Devices share', names='category')
fig.show()
# !pip install iso_language_codes
# from iso_language_codes import *
# languages = language_dictionary()
# def lang_mapper(row):
# try:
# return languages[row['language']]['Name']
# except:
# if row['language'] == 'zh-cn':
# return 'Chinese (PRC)'
# else:
# return 'Chinese (Taiwan)'
# reviews_clean['language_full'] = reviews_clean.apply(lang_mapper, axis=1)
devices_lang=devices.groupby(['language']).agg('count')
devices_lang=devices_lang.reset_index()
fig = px.bar(devices_lang.sort_values('user_pseudo_id', ascending=False).head(10), x='language', y='user_pseudo_id', title='Popular languages')
fig.show()
devices_brand=devices.groupby(['mobile_brand_name']).agg('count')
devices_brand=devices_brand.reset_index()
fig = px.bar(devices_brand.sort_values('user_pseudo_id', ascending=False).head(10), x='mobile_brand_name', y='user_pseudo_id', title='Popular brands')
fig.show()
def os_mapper(row):
if 'Android' in row['operating_system_version']:
return 'Android'
elif 'iOS' in row['operating_system_version']:
return 'iOS'
else:
return "other"
devices['operating_system'] = devices.apply(os_mapper, axis=1)
devices_os=devices.groupby(['operating_system']).agg('count')
devices_os=devices_os.reset_index()
fig = px.pie(devices_os, values='user_pseudo_id', title='OS share', names='operating_system')
fig.show()
geo.head()
| user_pseudo_id | country | city | |
|---|---|---|---|
| 0 | 1DA3967CD6764C5DB22BA97B7A997D24 | Australia | Melbourne |
| 1 | 1DA3967CD6764C5DB22BA97B7A997D24 | Australia | Melbourne |
| 2 | 5B568F76526648E7AAB6C2641D84EBDB | China | Guangzhou |
| 3 | 5B568F76526648E7AAB6C2641D84EBDB | China | Guangzhou |
| 4 | 5B568F76526648E7AAB6C2641D84EBDB | China | Guangzhou |
geo=geo.drop_duplicates()
len(geo)
124000
len(geo['user_pseudo_id'].unique())
83928
geo['city'].isna().sum()
624
geo['city']=geo['city'].fillna(value='Unknown')
geo_country=geo.groupby(['country']).agg('nunique')
geo_country=geo_country.reset_index()
# geo_country = geo_country.sort_values('user_pseudo_id', ascending=False).head(10)
# fig = px.bar(geo_country, title='Popular OS')
# fig.show()
len(geo_country['country'].unique())
209
fig = px.scatter(geo_country, x="user_pseudo_id", y="country",size="user_pseudo_id", color="country",hover_name="country")
fig.show()
# conda install -c conda-forge pycountry
import pycountry
pycountry.countries
<pycountry.ExistingCountries at 0x7f9e9fd9f550>
def country_mapper(row):
try:
country = pycountry.countries.get(name=row['country'])
return country.alpha_2
except:
if row['country']=='Russia':
return 'RU'
else: return 'IR'
geo_country['iso_alpha'] = geo_country.apply(country_mapper, axis=1)
geo_country
| country | user_pseudo_id | city | iso_alpha | |
|---|---|---|---|---|
| 0 | 152 | 2 | IR | |
| 1 | Afghanistan | 316 | 3 | AF |
| 2 | Albania | 31 | 3 | AL |
| 3 | Algeria | 1058 | 15 | DZ |
| 4 | American Samoa | 1 | 1 | AS |
| ... | ... | ... | ... | ... |
| 204 | Vietnam | 623 | 88 | IR |
| 205 | Western Sahara | 9 | 1 | EH |
| 206 | Yemen | 503 | 4 | YE |
| 207 | Zambia | 4 | 3 | ZM |
| 208 | Zimbabwe | 13 | 2 | ZW |
209 rows × 4 columns
fig = px.scatter_geo(geo_country, locations='country', locationmode='country names',
hover_name="country", size="user_pseudo_id", color='country',
projection="equirectangular", title='Geographical distribution of unique app users')
fig.show()
# pio.write_html(fig, file='geo_distr.html', auto_open=True)
geo_country=geo.groupby(['country', 'city']).agg('nunique')
geo_country=geo_country.reset_index()
geo_country=geo_country.sort_values('user_pseudo_id', ascending=False).head(20)
fig = px.treemap(geo_country, path=['country', 'city'],
values='user_pseudo_id', title='Popular cities')
fig.show()
# import plotly.io as pio
# pio.write_html(fig, file='cities_distr.html', auto_open=True)
traffic.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5761161 entries, 0 to 5761160 Data columns (total 4 columns): # Column Dtype --- ------ ----- 0 user_pseudo_id object 1 name object 2 medium object 3 source object dtypes: object(4) memory usage: 175.8+ MB
traffic=traffic.drop_duplicates()
traffic['source'].unique()
array(['instagram', 'get', 'Selene', nan, 'google-play', 'selene',
'mynorbu', 'google', '(direct)', 'mixer', 'flON', 'flnorbu',
'flonnorbu', None, 'appbrain', 'vc'], dtype=object)
traffic_source=traffic.groupby(['source']).agg('nunique')
traffic_source=traffic_source.reset_index()
fig = px.bar(traffic_source, x='source', y='user_pseudo_id', title='Amount of users from different traffic sources')
fig.show()
# import plotly.io as pio
# pio.write_html(fig, file='traffic_distr.html', auto_open=False)
events=events.merge(traffic, how='right', on='user_pseudo_id')
events.head()
| event_date | event_timestamp | event_name | event_previous_timestamp | event_value_in_usd | event_bundle_sequence_id | event_server_timestamp_offset | user_pseudo_id | week | month | time_between_events | name | medium | source | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2021-04-17 | 2021-04-17 12:25:23 | first_open | 2021-04-17 12:25:23 | 0.0 | 1.0 | 508.0 | 0cbfacb17dac3d66a049c15a243cbcf3 | 15 | 2021-04-01 | 0.0 | insta | ||
| 1 | 2021-04-17 | 2021-04-17 12:25:23 | user_engagement | 2021-04-17 12:25:23 | 0.0 | 1.0 | 508.0 | 0cbfacb17dac3d66a049c15a243cbcf3 | 15 | 2021-04-01 | 0.0 | insta | ||
| 2 | 2021-04-17 | 2021-04-17 12:25:23 | session_start | 2021-04-17 12:25:23 | 0.0 | 2.0 | 509.0 | 0cbfacb17dac3d66a049c15a243cbcf3 | 15 | 2021-04-01 | 0.0 | insta | ||
| 3 | 2021-04-17 | 2021-04-17 12:25:23 | screen_view | 2021-04-17 12:25:23 | 0.0 | 2.0 | 509.0 | 0cbfacb17dac3d66a049c15a243cbcf3 | 15 | 2021-04-01 | 0.0 | insta | ||
| 4 | 2021-04-17 | 2021-04-17 12:25:24 | firebase_campaign | 2021-04-17 12:25:24 | 0.0 | 2.0 | 509.0 | 0cbfacb17dac3d66a049c15a243cbcf3 | 15 | 2021-04-01 | 0.0 | insta |
events=events.merge(geo, how='right', on='user_pseudo_id')
users_per_day_country = events.pivot_table(index={'event_date', 'country'}, values='user_pseudo_id', aggfunc='nunique')
users_per_day_country.reset_index(inplace=True)
users_per_day_country=users_per_day_country.sort_values('user_pseudo_id', ascending=False)
fig=px.area(users_per_day_country, x='event_date', y='user_pseudo_id', color='country', title='Number of users per day from different countries')
fig.show()
# pio.write_html(fig, file='users_geo_distr.html', auto_open=False)
There are some spikes and declines, but overall number of users is ditributed quite evenly.
users_per_day_source = events.pivot_table(index={'event_date', 'source'}, values='user_pseudo_id', aggfunc='nunique')
users_per_day_source.reset_index(inplace=True)
users_per_day_source=users_per_day_source.sort_values('user_pseudo_id', ascending=False)
fig=px.area(users_per_day_source, x='event_date', y='user_pseudo_id', color='source', title='Number of users per day from different sources')
fig.show()
# pio.write_html(fig, file='inactivity.html', auto_open=False)
There are just two traffic sources for now.
Probably, it will make sense to choose another potentially successful traffic source to develop: social or email traffic Source.
Plus, improve existing sources, for example, perform Google Play Optimization (if not yet). Currently, not all key words result in finding Norbu app: meditation, mindfullness, breathing do not work, stress/stress control works